Nowadays, when people talk about the rise of our planet's average surface temperature, they will inevitably mention carbon dioxide and other Greenhouse Gases (GHGs). We can easily check the latest CO2 data using Python. CO2 data can be downloaded from esrl, covering the period from Mar/1958 to Apr/2018. CO2 expressed as a mole fraction in dry air, micromol/mol, abbreviated as ppm.
The data are a typical time series data, which are one of the most common data types. One powerful yet simple method for analyzing and predicting periodic data is the additive model. The idea is straightforward: represent a time-series as a combination of patterns at different scales such as daily, weekly, seasonally, and yearly, along with an overall trend.
In this notebook, we will introduce some common techniques used in time-series analysis and walk through the iterative steps required to manipulate, visualize time-series data.
In [11]:
import pandas as pd
import statsmodels.api as sm
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# Set some parameters to apply to all plots. These can be overridden
import matplotlib
# Plot size to 12" x 7"
matplotlib.rc('figure', figsize = (12, 7))
# Font size to 14
matplotlib.rc('font', size = 14)
# Do not display top and right frame lines
matplotlib.rc('axes.spines', top = False, right = False)
# Remove grid lines
matplotlib.rc('axes', grid = False)
# Set backgound color to white
matplotlib.rc('axes', facecolor = 'white')
In [2]:
co2 = pd.read_csv('data\co2_mm_mlo.txt',
skiprows=72,
header=None,
comment = "#",
delim_whitespace = True,
names = ["year", "month", "decimal_date", "average", "interpolated", "trend", "days"],
na_values =[-99.99, -1])
co2['Date'] = co2['year']*100 + co2['month']
co2['Date'] = pd.to_datetime(co2['Date'], format='%Y%m', errors='ignore')
co2.set_index('Date', inplace=True)
In [3]:
co2.drop(["year", "month", "decimal_date", "interpolated", "trend", "days"], axis=1, inplace=True)
In [4]:
co2.head()
Out[4]:
Real world data tends to be messy. Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. Handling missing data is important as many data analysis algorithms do not support data with missing values.
The simplest way is using the command of isnull to reveal missing data.
In [5]:
co2.isnull().sum()
Out[5]:
There are 7 months with missing values in our time series.
The simplest strategy for handling missing data is to drop those records that contain a missing value. Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. The syntax of drop rows with missing values looks like: dataset.dropna(inplace=True).
However, we should "fill in" missing values if they are not too numerous so that we don’t have gaps in the data. This can be done using the fillna() command in pandas. The filling methods consist of
For simplicity, missing values are filled with the closest non-null value in CO2 time series, although it is important to note that a rolling mean would sometimes be preferable.
In [6]:
co2 = co2.fillna(co2.bfill())
Now the number of missing values should be 0.
In [7]:
co2.isnull().sum()
Out[7]:
In [8]:
co2.plot(title='Monthly CO2 (ppm)')
Out[8]:
From the above image, we can find that there may be a linear trend, but it is hard to be sure from eye-balling. Moreover, it has an obvious seasonality pattern, but the amplitude (height) of the cycles appears to be stable, suggesting that it should be suitable for an additive model.
We can also visualize our data using a method called time-series decomposition. As its name suggests, time series decomposition allows us to decompose our time series into three distinct components: trend, seasonality, and noise.
In [9]:
decomposition = sm.tsa.seasonal_decompose(co2, model='additive')
fig = decomposition.plot()
Each component of decomposition is accessible via:
For example, we can check the trend in 1991.
In [10]:
decomposition.trend['1991']
Out[10]:
Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.
Data Structures for Statistical Computing in Python; presented at SciPy 2010
pandas: a Foundational Python Library for Data Analysis and Statistics; presented at PyHPC2011
http://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.seasonal_decompose.html
https://climatedataguide.ucar.edu/climate-data-tools-and-analysis/trend-analysis
In [ ]: